Introduction

This practical exercise builds upon existing ggplot2 resources.

The first part of the notebook will walk you through the the different elements that make up a plot and highlight tips that will help with practical code implementation of ggplot2. It is a condensed version of an online workshop delivered by a member of the Tidyverse team, Thomas Lin Pedersen, which can be viewed on YouTube.

The second part of the notebook provides a practical framework to develop a consistent design style for yourself or your organisation.

Part 1: The grammar of graphics

ggplot2 is an implementation of the theoretical framework for constructing data visualisations known as the Grammar of Graphics.

Rather than thinking of data visualisations as entirely separate entities, e.g., box plot, line chart, bar chart, scatter plot, the Grammar of Graphics breaks all data visualisations into eight constituent parts, which are layered on top of each other. The theory explains the relationships between these eight parts:

This approach means you are not limited to a set of predefined data visualisations. You can build plots from the bottom up, and tailor them however you prefer.

In the sections below we will use the Grammar of Graphics framework to explore how to create plots with ggplot2.

Dependencies and data sets

This document comes with a list of required libraries: ‘ggplot2’ and ‘dplyr’. We will use two of the build-in data sets ‘diamonds’ and ‘iris’.

The grammar of graphics: Data

DATA / mapping / statistics / scales / geometries / facets / coordinates / theme
  • ggplot() initialises a ggplot object
  • It can be used to declare:
    • the input data frame (data)
    • plot aesthetics (mapping)
  • The data should be a tidy data frame format
  • Use + to combine ggplot2 elements
  • Adding a geometry creates a plot (geometries)
# Data 1
# Data can be added to ggplot(). If set here the data and mapping values will be inherited by subsequent layers.
ggplot(data = iris,
       mapping = aes(x = Sepal.Length, y = Petal.Length)) +
  geom_point()

# Data 2
# Another method is to set data in the ggplot() element and the mapping in the geom layer. 
ggplot(data = iris) +
 geom_point(mapping = aes(Sepal.Length, Petal.Length))

# Data 3
# You can also set both the data and the mapping at the geom layer
ggplot() +
  geom_point(data = iris, mapping = aes(x = Sepal.Length, y = Petal.Length))

# Data 4
# It is common for the arguments not be used
ggplot(iris) +
 geom_point(aes(x = Sepal.Length, y = Petal.Length))

Note: to create the plot in the example above you only needed to:
- Specify which data to use
- Select the aesthetic mappings
- Select a geometry
This is because there are a lot of sensible defaults written into the ggplot2 package that handles the other elements of the plot.

The grammar of graphics: Mapping

data / MAPPING / statistics / scales / geometries / facets / coordinates / theme
  • Aesthetic mappings link variables in the data to graphical properties in the geometries.
  • They inform the graphics of which variables should be prepresented by the x and y-axes, whether a variable should be represented by a colour, size or shape etc.
  • Different geoms have different mapping requirements (e.g., geom_point() requires x and y values and geom_histogram() only requires x values). If you are unsure of a geom’s mapping requirements you can check the ‘Aesthetics’ section of the help documentation (e.g., ?geom_point).
# Mapping 1
# You can map colour to your data using the colour aesthetic. 
# We can use an expression to map the colour, as the x and y values are not passed in as strings. Note that a legend is created when you map a variable to colour.  
ggplot(data = iris) +
 geom_point(mapping = aes(x = Sepal.Length, 
                          y = Petal.Length, 
                          colour = Petal.Length >2)) 

# Mapping 2
# It is important to remember the difference between mapping a colour and setting a colour
# This is how you set the colour on a plot
ggplot(data = iris) +
 geom_point(mapping = aes(x = Sepal.Length, 
                          y = Petal.Length),
            colour = "purple")

# Mapping 3
# Some geometries use fill to set the colour inside the graphic
ggplot(data = iris) +
 geom_histogram(mapping = aes(x = Sepal.Length),
            fill = "purple",
            colour = "navy")

# Mapping 4
# This is how **NOT** to set the colour of a plot
# Do you know what is happening here?
ggplot(data = iris) +
 geom_point(mapping = aes(x = Sepal.Length, 
                          y = Petal.Length,
                          colour = "purple"))

Remember: If you want to map a colour from the data put it in aes() if you want to set a colour or size put it outside aes().


Mapping Exercises

Update the code to make the points larger purple triangles and slightly transparent.

ggplot(iris) + 
  geom_point(aes(x = Sepal.Length, y = Petal.Length))


Colour the two distributions in the histogram with different colours.

ggplot(iris) + 
  geom_histogram(aes(x = Petal.Length))


The grammar of graphics: Geometries

data / mapping / statistics / scales / GEOMETRIES / facets / coordinates / theme
  • Geometries take the scale values and plots them according to a specific geometry
  • The same data input can have a vast array of geometric representations
  • For example the scatter plot and line plot can take the same the data but you will get a different output
  • Geometries can be stacked on top of each other based on your code order
# Geometries 1
# In this example you can see that the points sit on top of the line 
ggplot(data = iris) +
  geom_line(mapping = aes(x = Sepal.Length, y = Petal.Length), colour = "pink") +
  geom_point(mapping = aes(x = Sepal.Length, y = Petal.Length)) 

# In this example the line sits on top of the points 
# This is due to the order of the geometries
ggplot(data = iris) +
  geom_point(mapping = aes(x = Sepal.Length, y = Petal.Length)) +
  geom_line(mapping = aes(x = Sepal.Length, y = Petal.Length), colour = "pink")

The grammar of graphics: Statistics

data / mapping / STATISTICS / scales / geometries / facets / coordinates / theme

The statistics part refers to the calculations that are happening to transform your raw data into the values we want to display on the plot. For example, a bar chart calculates counts and box plots calculate a range of distribution values. - Every geometry has a default statistic. - Every statistic has a default geometry - Both can be used but people are used to thinking in geometries. You visualise bars rather than counts. - You can access the values generated through the stat by using the after_stat() function.

#geom_bar() uses stat_count() by default 
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x=cut)) +
  ggtitle("geom_bar()")

# we get the same plot because the default geom for stat_count() is geom_bar() 
ggplot(data = diamonds) +
  stat_count(mapping = aes(x=cut)) +
  ggtitle("stat_count()")

# but you can update the geom to something else - normally you wouldn't want to override this.
ggplot(data = diamonds) +
  stat_count(mapping = aes(x=cut), geom = 'point') +
  ggtitle("stat_count()")

# You can calculate the statistics yourself and plot the results 
diamonds_counted_by_cut = diamonds %>%
  group_by(cut) %>%
  summarise(count = n())

ggplot(data = diamonds_counted_by_cut) +
  geom_bar(mapping = aes(x=cut, y=count),
           stat = 'identity') + # identity stat doesn't compute anything 
  ggtitle("stat = 'identity")

# ggplot2 provide geometries for the most frequently used combinations of statistics and geometries. For example, geom_col() is geom_bar() with the stat set to "identity".
ggplot(data = diamonds_counted_by_cut) +
  geom_col(mapping = aes(x=cut, y=count)) +
  ggtitle("geom_col()")

# You can access the values generated through the stat by using the after_stat() function.
# Here we will change the counts to percentages
#geom_bar() uses stat_count() by default 
ggplot(data = diamonds) +
  geom_bar(mapping = 
             aes(
               x = cut,
               y = after_stat(100 * count / sum(count))
               )
           ) + labs(y = "Percentage")

# Computed variables are provided in the stat_ documentation
# ?stat_count 
# ?stat_density
# ?stat_boxplot

Statistic Exercise

Use stat_summary() to add a red dot at the mean depth for each group

ggplot(diamonds) + 
  geom_jitter(aes(x = cut, y = depth), width = 0.3)

Hint: You will need to change the default geom of stat_summary()


The grammar of graphics: Scales

data / mapping / statistics / SCALES / geometries / facets / coordinates / theme

Scales map our input data to the graphical output. Scales define how the mapping you specify inside aes() should happen. All mappings have an associated scale even if not specified. For example, when you assign a variable to colour it determines what kind of values they are (discrete or continuous, binned) and scales them accordingly.

# ggplot looks at the Species variable, sees it has discrete values and automatically assigns a colour to the three Species values. It also creates a legend so you can map back to the points. If we don't like these colours we need to add a scale to update the default colours.
ggplot(iris) + 
  geom_point(aes(x = Petal.Length, y = Petal.Width, colour = Species))

# If we don't like the colours ggplot assigns we need to add a scale to update the default colours
# The brewer scales provide sequential, diverging and qualitative colour schemes from ColorBrewer. 
# Type: One of seq (sequential), div (diverging) or qual (qualitative)
ggplot(iris) + 
  geom_point(aes(x = Petal.Length, y = Petal.Width, colour = Species)) + 
  scale_color_brewer(type = 'qual')

Remember: You can add colours manually but be aware that colours and complex so you need to consider whether your chosen colours represent the data. For example, qualitative schemes should not imply magnitude differences. You should also consider how they are percieved by those with colour deficiencies.

# NICD branded colours 
ggplot(iris) + 
  geom_point(aes(x = Petal.Length, y = Petal.Width, colour = Species)) + 
  scale_color_manual(values = c("#002D30", "#4FE18F", "#A361FF")) 

Positional mappings (x and y) also have associated scales. You can use scale to change break points, gridlines and transform the data. Each scale has built-in transformations but the default is normally set as ‘identity’ - do not transform that data. Built-in transformations include “asn”, “atanh”, “boxcox”, “date”, “exp”, “hms”, “identity”, “log”, “log10”, “log1p”, “log2”, “logit”, “modulus”, “probability”, “probit”, “pseudo_log”, “reciprocal”, “reverse”, “sqrt” and “time”.

# scale the continous x and y values
ggplot(iris) + 
  geom_point(aes(x = Petal.Length, y = Petal.Width)) +
  scale_x_continuous(breaks = c(3, 5, 6)) + 
  scale_y_continuous(trans = 'log10')

> Remember: if you get stuck, there is a lot of documentation available for ggplot2.


Scale Exercise

Modify the code below to create a bubble chart (scatterplot with size mapped to a continuous variable) showing cyl with size. Make sure that only the present amount of cylinders (4, 5, 6, and 8) are present in the legend.

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy, colour = class, size = cyl)) + 
  scale_colour_brewer(type = 'qual') +
  #scale_size(breaks = c(4,5,6,8)) #scales area
#scale_radius(breaks = c(4,5,6,8))
scale_size_area(breaks = c(4,5,6,8))

# the eye percieves the area not the radius. 

Hint: The breaks argument in the scale is used to control which values are present in the legend.

The grammar of graphics: Facets

data / mapping / statistics / scales / geometries / FACETS / coordinates / theme
  • Facets divide a plot into subplots based on the values of one or more discrete values.
  • It is a good way to avoid overplotting.
  • They should not be used to combine separate plots.
  • They share an axis, so it makes comparison easy

You may wish to have multiple plotting areas. It is the layout of your subplots and the layout may have some meaning.

# facet_wrap() wraps a 1d sequence of panels into 2d.
ggplot(mpg) +
  geom_point(aes(x=displ, y=hwy)) +
  facet_wrap(~ class)

#facet_grid() forms a matrix of panels defined by row and column faceting variables. It is most useful when you have two discrete variables, and all combinations of the variables exist in the data. If you have only one variable with many levels, try facet_wrap().

ggplot(mpg) +
  geom_point(aes(x=displ, y=hwy)) +
  facet_wrap(year ~ drv)

The grammar of graphics: Coordinates

data / mapping / statistics / scales / geometries / facets / COORDINATES / theme
  • The coordinate system modifies the x and y values
  • The scaled positions are passed through the coordinate system. You often don’t think about it as we are used to the using the cartesian coordinate system but if you work in mapping you will be more familiar with other coordinate systems..
  • Limits and transformation can be applied in scale or in coord
    • scale will apply it in the beginning
    • coord will apply it at the end
    • You usually want coord if you want to update axes. If you limit the axis with scale it will remove values outside of your scale, whereas if you scale with coord_cartesian you are properly zooming - you are not affecting the data

Other coordinate systems
- coord_cartesian: most plots are drawn in a cartesian coordinate system
- coord_polar: interprets x and y as radius and angle

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x=cut)) +
  coord_polar()

Remember: remember to us coord_cartesian() rather than scale when you don’t want to affect your data.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x=cut))# +

  #scale_y_continuous(limits = c(0, 12500))
  #coord_cartesian(ylim=c(0, 12500))

The grammar of graphics: Theme

data / mapping / statistics / scales / geometries / facets / coordinates / THEME
  • Stylistic changes to the plot not related to the data.
  • In ggplot2 theme() is hierarchical in nature
  • Part 2 of the workshop covers theme() in detail.

Part 2: Designing a consistent design style using theme()

Why focus on styling? Decision making is the key competency in data visualisation. We need to make effective decisions efficiently. We can use a design process to facilitate effective decision making and we can adopt a standard theme, based upon good design principals, to streamline the final presentation work.

theme() arguments

# this code will not run
theme(
  line,
  rect,
  text,
  title,
  aspect.ratio,
  axis.title,
  axis.title.x,
  axis.title.x.top,
  axis.title.x.bottom,
  axis.title.y,
  axis.title.y.left,
  axis.title.y.right,
  axis.text,
  axis.text.x,
  axis.text.x.top,
  axis.text.x.bottom,
  axis.text.y,
  axis.text.y.left,
  axis.text.y.right,
  axis.ticks,
  axis.ticks.x,
  axis.ticks.x.top,
  axis.ticks.x.bottom,
  axis.ticks.y,
  axis.ticks.y.left,
  axis.ticks.y.right,
  axis.ticks.length,
  axis.ticks.length.x,
  axis.ticks.length.x.top,
  axis.ticks.length.x.bottom,
  axis.ticks.length.y,
  axis.ticks.length.y.left,
  axis.ticks.length.y.right,
  axis.line,
  axis.line.x,
  axis.line.x.top,
  axis.line.x.bottom,
  axis.line.y,
  axis.line.y.left,
  axis.line.y.right,
  legend.background,
  legend.margin,
  legend.spacing,
  legend.spacing.x,
  legend.spacing.y,
  legend.key,
  legend.key.size,
  legend.key.height,
  legend.key.width,
  legend.text,
  legend.text.align,
  legend.title,
  legend.title.align,
  legend.position,
  legend.direction,
  legend.justification,
  legend.box,
  legend.box.just,
  legend.box.margin,
  legend.box.background,
  legend.box.spacing,
  panel.background,
  panel.border,
  panel.spacing,
  panel.spacing.x,
  panel.spacing.y,
  panel.grid,
  panel.grid.major,
  panel.grid.minor,
  panel.grid.major.x,
  panel.grid.major.y,
  panel.grid.minor.x,
  panel.grid.minor.y,
  panel.ontop,
  plot.background,
  plot.title,
  plot.title.position,
  plot.subtitle,
  plot.caption,
  plot.caption.position,
  plot.tag,
  plot.tag.position,
  plot.margin,
  strip.background,
  strip.background.x,
  strip.background.y,
  strip.placement,
  strip.text,
  strip.text.x,
  strip.text.y,
  strip.switch.pad.grid,
  strip.switch.pad.wrap,
  ...,
  complete = FALSE,
  validate = TRUE
)

Element functions

theme() constructs plots as a combination of:

  • Rectangles
  • Lines
  • Text
  • Empty space

The element_ functions specify the display of how non-data components of the plot are drawn.

  • element_blank() draws nothing, and assigns no space
  • element_rect() borders and backgrounds
  • element_line() lines
  • element_text() text

Rectangles

element_rect()

# this code will not run
element_rect(
  fill = NULL,
  colour = NULL,
  size = NULL,
  linetype = NULL,
  color = NULL,
  inherit.blank = FALSE
)
  • plot background
  • panel background
  • legend.key
  • legend background
  • legend box background
# Colours
  dark_green <- "#002D30"
  light_grey <- "#F2F2F2"
  green_blue <- "#007883"
  light_green <- "#4FE18F"
  blue <- "#00C3D6"
  pink <- "#FF709D"
  purple <- "#A361FF"
  orange <- "#ffa45a"
# Font  
#  font <- "Derailed"

# The plot
ggplot(data = iris) +
  geom_point(mapping = aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
  labs(title = "Differences in flower size by iris species",
       subtitle = "Plot of petal width by petal length",
       caption = "Data source: R") +

  # Theme  
  theme(
    # You can use this to remove all lines from the plot
    #rect = element_blank(),
    
    # Plot rectangle format:
    plot.background = element_rect(fill = pink),
    
    # Panel rectangle format:
    panel.background = element_rect(fill = dark_green),
    #panel.border = element_rect(fill = light_grey, colour = pink), # you probably will never need to use this
    
    legend.key = element_rect(fill = "white"),
    legend.background = element_rect(fill = orange),
    legend.margin = margin(9,9,9,9),
    
    legend.box.background = element_rect(fill = purple, colour = purple),
    legend.box.margin = margin(9,9,9,9),
    )

Lines

element_line()

# this code will not run
element_line(
   colour = NULL,
   size = NULL,
   linetype = NULL,
   lineend = NULL,
   color = NULL,
   arrow = NULL,
   inherit.blank = FALSE
)
  • axis ticks
  • axis line x
  • axis line y
  • panel grid minor
  • panel grid major
# Colours
  dark_green <- "#002D30"
  light_grey <- "#F2F2F2"
  green_blue <- "#007883"
  light_green <- "#4FE18F"
  blue <- "#00C3D6"
  pink <- "#FF709D"
  purple <- "#A361FF"
  orange <- "#ffa45a"
# Font  
#  font <- "Derailed"

# The plot
ggplot(data = iris) +
  geom_point(mapping = aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
  labs(title = "Differences in flower size by iris species",
       subtitle = "Plot of petal width by petal length",
       caption = "Data source: R") +

  # Theme  
  theme(
    # You can use this to remove all lines from the plot
    #line = element_blank(),
    
    # Update the axis ticks
    axis.ticks = element_line(colour = pink, lineend = "round", size = 2),
    axis.ticks.length = unit(.5, "cm"),
    
    # Update the axis lines
    axis.line.x = element_line(colour = orange, lineend = "round", size = 2),
    axis.line.y = element_line(colour = light_green, lineend = "round", size = 2),
    
    # Update the grid lines
    #panel.grid = element_line(colour = pink),
    panel.grid.minor = element_line(colour = purple),
    panel.grid.major = element_line(colour = blue)
    )

Text

element_text()

# this code will not run
element_text(
  family = NULL,
  face = NULL,
  colour = NULL,
  size = NULL,
  hjust = NULL,
  vjust = NULL,
  angle = NULL,
  lineheight = NULL,
  color = NULL,
  margin = NULL,
  debug = NULL,
  inherit.blank = FALSE
)
  • plot title
  • plot subtitle
  • plot caption
  • legend title
  • legend text
  • axis title x
  • axis title y
  • axis text
  • strip text
# Colours
  dark_green <- "#002D30"
  light_grey <- "#F2F2F2"
  green_blue <- "#007883"
  light_green <- "#4FE18F"
  blue <- "#00C3D6"
  pink <- "#FF709D"
  purple <- "#A361FF"
  orange <- "#ffa45a"
# Font  
# font <- "Derailed"

# The plot
ggplot(data = iris) +
  geom_point(mapping = aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
  labs(title = "Differences in flower size by iris species",
       subtitle = "Plot of petal width by petal length",
       caption = "Data source: R") +

  # Theme  
  theme(
    # Title text format:
    # Set the font family and font colour 
    #text = element_text(family=font,
    #                    color=purple),
    
    # Make additional changes (e.g., size, type, margins) to the plot's title, subtitle and caption and the legend title and text here. 
    plot.title = element_text(size=18,
                              face="bold",
                              colour = purple
                              ),
    
    plot.subtitle = element_text(size=12,
                                 margin=margin(9,0,9,0),
                                 colour = blue
                                 ),
    
    plot.caption = element_text(size=10,
                                colour = pink),
    
    # Legend text format:
    legend.title = element_text(size=12, 
                                face = "bold",
                                colour = orange),
    
    legend.text = element_text(size=12, 
                               colour = dark_green),
    
    # Axis text format:
    axis.title.y = element_text(size=12, 
                              colour = light_green),
    axis.title.x = element_text(size=12, 
                              colour = blue),
    # SET AS BLANK?
    
    # Notice that the font is pulled through in the text hierarchy but colour is not for axis.text or strip.text
    axis.text = element_text(size=10,
                             colour = green_blue),
  
    # Strip text 
    strip.text = element_text(size=12, 
                              hjust = 0,
                              colour = orange)
    )

Exercise

For this exercise we will work with the following example plot.

ggplot(data = iris) +
  geom_point(mapping = aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
  facet_grid(cols = vars(Species)) + 
  labs(title = "Differences in flower size by iris species",
       subtitle = "Plot of petal width by petal length",
       caption = "Data source: R")


Task:

ggplot(data = iris) +
  geom_point(mapping = aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
  facet_grid(cols = vars(Species)) + 
  labs(title = "Differences in flower size by iris species",
       subtitle = "Plot of petal width by petal length",
       caption = "Data source: R") +
  theme(
    
    # RECTANGLES
    # element_rect()
    #plot.background = element_rect(),
    #panel.background = element_rect(),
    
    #legend.key = element_rect(),
    #legend.background = element_rect(),
    #legend.margin = margin(),
    #legend.position(),
    #legend.direction(),
    #legend.justification(),
    
    #legend.box.background = element_rect(),
    #legend.box.margin = margin(),
    
    #panel.spacing(),
    
    # LINES
    #axis.ticks = element_line(),
    #axis.ticks.length = unit(.5, "cm"),

    #axis.line = element_line()
    #axis.line.x = element_line(),
    #axis.line.y = element_line(),
    
    #panel.grid = element_line(),
    #panel.grid.minor = element_line(),
    #panel.grid.major = element_line(),
    
    # TEXT
    #text = element_text(),
    #plot.title = element_text(),
    #plot.subtitle = element_text(),
    #plot.caption = element_text(),
  
    #legend.title = element_text(),
    #legend.title.align(),  
    #legend.text = element_text(),
    #legend.text.align(),   

    
    #axis.title = element_text(),
    #axis.title.y = element_text(),
    #axis.title.x = element_text(),
    
    #axis.text = element_text(),
  
    #strip.text = element_text()
    
    )